Part 2: Advanced Data Visualization¶
This phase focuses on enhancing the dataset by creating meaningful features that capture temporal patterns, store-specific behaviors, and promotional effects. It also includes visual exploration to uncover trends, seasonality, and anomalies—laying the groundwork for robust forecasting models.
1. Setup & Imports Libraries¶
In [1]:
import time
In [2]:
# Step 1: Setup & Imports Libraries
print("Step 1: Setup and Import Libraries started...")
time.sleep(1) # Simulate processing time
Step 1: Setup and Import Libraries started...
In [3]:
# Data Manipulation & Processing
import math
import numpy as np
import pandas as pd
import scipy.stats as stats
from datetime import datetime
from sklearn.preprocessing import *
# Data Visualization
import seaborn as sbn
import matplotlib.pyplot as plt
import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pandas.plotting import scatter_matrix
# to ensure Plotly works in both Jupyter and HTML export
pio.renderers.default = "notebook+plotly_mimetype"
from scipy.stats import ttest_ind
from scipy.stats import pearsonr
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.preprocessing import StandardScaler
sbn.set(rc={'figure.figsize':(14,6)})
plt.style.use('seaborn-v0_8')
sbn.set_palette("husl")
# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format','{:.2f}'.format)
# Warnings
import warnings
warnings.simplefilter('ignore')
warnings.filterwarnings('ignore')
In [4]:
print("="*60)
print("Rossman Store Sales Time Series Analysis - Part 2")
print("="*60)
print("All libraries imported successfully!")
print("Analysis Date:", pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S'))
============================================================ Rossman Store Sales Time Series Analysis - Part 2 ============================================================ All libraries imported successfully! Analysis Date: 2025-08-12 18:46:55
In [5]:
print("✅ Setup and Import Liraries completed.\n")
✅ Setup and Import Liraries completed.
In [6]:
# Start analysis
part_2_begin = pd.Timestamp.now()
bold_start = '\033[1m'
bold_end = '\033[0m'
print("🔍 Part 2 Started ...")
print(f"🟢 Begin Date: {bold_start}{part_2_begin.strftime('%Y-%m-%d %H:%M:%S')}{bold_end}\n")
🔍 Part 2 Started ...
🟢 Begin Date: 2025-08-12 18:46:55
Restore the file¶
In [7]:
%store -r df_features
View or Display Dataset¶
In [8]:
df_features.head()
Out[8]:
| store | dayofweek | date | sales | customers | open | promo | stateholiday | schoolholiday | isholiday | isschoolDay | day | week | month | quarter | year | isweekend | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 982643 | 1115 | 2 | 2013-01-01 | 0 | 0 | 0 | No Promo | Public | 1 | True | False | Tue | 1 | Jan | 1 | 2013 | False |
| 982640 | 1112 | 2 | 2013-01-01 | 0 | 0 | 0 | No Promo | Public | 1 | True | False | Tue | 1 | Jan | 1 | 2013 | False |
| 982639 | 1111 | 2 | 2013-01-01 | 0 | 0 | 0 | No Promo | Public | 1 | True | False | Tue | 1 | Jan | 1 | 2013 | False |
| 982638 | 1110 | 2 | 2013-01-01 | 0 | 0 | 0 | No Promo | Public | 1 | True | False | Tue | 1 | Jan | 1 | 2013 | False |
| 982637 | 1109 | 2 | 2013-01-01 | 0 | 0 | 0 | No Promo | Public | 1 | True | False | Tue | 1 | Jan | 1 | 2013 | False |
In [9]:
print("✅ Data Engineering completed.\n")
✅ Data Engineering completed.
Correlation Analysis¶
In [10]:
def enhanced_correlation_analysis(df, target_col='sales', figsize=(16, 12)):
"""
Comprehensive correlation analysis with business insights
Parameters:
-----------
df : pandas.DataFrame
Input dataframe
target_col : str
Primary variable of interest (usually 'sales')
figsize : tuple
Figure size for plots
"""
# Get numeric columns only
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
# Remove redundant or non-meaningful columns
exclude_patterns = ['unnamed', 'index', 'id', 'level_0']
numeric_cols = [col for col in numeric_cols if not any(pattern in col.lower() for pattern in exclude_patterns)]
print(f"🔍 CORRELATION ANALYSIS")
print(f"="*50)
print(f"📊 Analyzing {len(numeric_cols)} numeric variables")
print(f"🎯 Primary target: {target_col}")
print(f"📋 Variables: {numeric_cols}")
# Calculate correlation matrix
corr_data = df[numeric_cols].corr()
# Create enhanced visualizations
fig = plt.figure(figsize=figsize)
gs = fig.add_gridspec(3, 3, height_ratios=[2, 1, 1], width_ratios=[2, 1, 1],
hspace=0.3, wspace=0.3)
# 1. Main correlation heatmap
ax1 = fig.add_subplot(gs[0, 0])
# Create a mask for the upper triangle
mask = np.triu(np.ones_like(corr_data, dtype=bool))
# Custom colormap
cmap = sbn.diverging_palette(230, 20, as_cmap=True)
# Create heatmap
sbn.heatmap(corr_data,
mask=mask,
annot=True,
cmap=cmap,
center=0,
square=True,
fmt='.2f',
cbar_kws={"shrink": .8},
ax=ax1,
annot_kws={'size': 9})
ax1.set_title('📊 Correlation Matrix Heatmap\n(Lower Triangle Only)',
fontsize=14, fontweight='bold', pad=20)
ax1.set_xlabel('')
ax1.set_ylabel('')
# Rotate labels for better readability
plt.setp(ax1.get_xticklabels(), rotation=45, ha='right')
plt.setp(ax1.get_yticklabels(), rotation=0)
# 2. Target variable correlations (sorted)
ax2 = fig.add_subplot(gs[0, 1])
if target_col in corr_data.columns:
target_corr = corr_data[target_col].drop(target_col).sort_values(key=abs, ascending=True)
# Color bars based on correlation strength
colors = ['red' if x < -0.5 else 'orange' if x < -0.2 else
'lightgray' if abs(x) < 0.2 else 'lightblue' if x < 0.5 else 'green'
for x in target_corr.values]
bars = ax2.barh(range(len(target_corr)), target_corr.values, color=colors, alpha=0.7)
ax2.set_yticks(range(len(target_corr)))
ax2.set_yticklabels([f'{label}' for label in target_corr.index], fontsize=9)
ax2.set_xlabel('Correlation with Sales')
ax2.set_title(f'🎯 Correlations with {target_col.title()}\n(Sorted by Strength)',
fontsize=12, fontweight='bold')
ax2.axvline(x=0, color='black', linestyle='-', alpha=0.3)
ax2.grid(axis='x', alpha=0.3)
# Add value labels on bars
for i, (bar, value) in enumerate(zip(bars, target_corr.values)):
ax2.text(value + (0.02 if value > 0 else -0.02), i, f'{value:.2f}',
va='center', ha='left' if value > 0 else 'right', fontsize=8)
# 3. Correlation strength distribution
ax3 = fig.add_subplot(gs[0, 2])
# Get upper triangle correlations (excluding diagonal)
upper_tri = corr_data.where(np.triu(np.ones(corr_data.shape), k=1).astype(bool))
correlations = upper_tri.stack().values
# Create histogram
ax3.hist(correlations, bins=20, alpha=0.7, color='skyblue', edgecolor='black')
ax3.axvline(x=0, color='red', linestyle='--', alpha=0.7, label='Zero Correlation')
ax3.axvline(x=correlations.mean(), color='orange', linestyle='--', alpha=0.7,
label=f'Mean: {correlations.mean():.2f}')
ax3.set_xlabel('Correlation Coefficient')
ax3.set_ylabel('Frequency')
ax3.set_title('📈 Distribution of\nCorrelation Coefficients', fontsize=12, fontweight='bold')
ax3.legend(fontsize=8)
ax3.grid(alpha=0.3)
# 4. Strong correlations table
ax4 = fig.add_subplot(gs[1, :])
ax4.axis('off')
# Find strong correlations (|r| > 0.5)
strong_corr = []
for i in range(len(corr_data.columns)):
for j in range(i+1, len(corr_data.columns)):
corr_val = corr_data.iloc[i, j]
if abs(corr_val) > 0.5:
strong_corr.append({
'Variable 1': corr_data.columns[i],
'Variable 2': corr_data.columns[j],
'Correlation': corr_val,
'Strength': 'Very Strong' if abs(corr_val) > 0.8 else 'Strong'
})
if strong_corr:
strong_df = pd.DataFrame(strong_corr)
strong_df = strong_df.sort_values('Correlation', key=abs, ascending=False)
# Create table with better spacing and positioning
table_data = []
for _, row in strong_df.head(6).iterrows(): # Show top 6 to avoid crowding
# Truncate long variable names
var1 = row['Variable 1'][:15] + '...' if len(row['Variable 1']) > 15 else row['Variable 1']
var2 = row['Variable 2'][:15] + '...' if len(row['Variable 2']) > 15 else row['Variable 2']
table_data.append([
var1,
var2,
f"{row['Correlation']:.3f}",
row['Strength']
])
table = ax4.table(cellText=table_data,
colLabels=['Variable 1', 'Variable 2', 'Correlation', 'Strength'],
cellLoc='center',
loc='upper center', # Changed position
colWidths=[0.28, 0.28, 0.18, 0.18], # Slightly wider columns
bbox=[0.05, 0.15, 0.9, 0.7]) # Larger table area
table.auto_set_font_size(False)
table.set_fontsize(10) # Increased font size
table.scale(1, 1.8) # Increased scaling for better readability
# Style the header
for j in range(4):
table[(0, j)].set_facecolor('#4472C4')
table[(0, j)].set_text_props(weight='bold', color='white')
table[(0, j)].set_height(0.12) # Taller header
# Color code the data rows
for i in range(len(table_data)):
corr_val = float(table_data[i][2])
if abs(corr_val) > 0.8:
color = '#ffcccc' if corr_val < 0 else '#ccffcc'
else:
color = '#ffe6cc' if corr_val < 0 else '#e6f3ff'
for j in range(4):
table[(i+1, j)].set_facecolor(color)
table[(i+1, j)].set_height(0.10) # Taller data cells
# Position title better
ax4.text(0.5, 0.95, '🔍 Strong Correlations (|r| > 0.5)',
ha='center', va='top', fontsize=12, fontweight='bold',
transform=ax4.transAxes)
else:
ax4.text(0.5, 0.5, 'No strong correlations found (|r| > 0.5)',
ha='center', va='center', fontsize=12, style='italic',
transform=ax4.transAxes)
ax4.text(0.5, 0.9, '🔍 Strong Correlations',
ha='center', va='top', fontsize=12, fontweight='bold',
transform=ax4.transAxes)
# 5. Business insights text
ax5 = fig.add_subplot(gs[2, :])
ax5.axis('off')
# Generate insights
insights = []
if target_col in corr_data.columns:
target_correlations = corr_data[target_col].drop(target_col)
# Strongest positive correlation
max_pos = target_correlations.max()
max_pos_var = target_correlations.idxmax()
# Strongest negative correlation
max_neg = target_correlations.min()
max_neg_var = target_correlations.idxmin()
# Count significant correlations
significant_count = (abs(target_correlations) > 0.3).sum()
insights.append(f"🎯 SALES DRIVERS: '{max_pos_var}' has the strongest positive relationship with sales (r={max_pos:.3f})")
if max_neg < -0.2:
insights.append(f"⚠️ NEGATIVE FACTOR: '{max_neg_var}' shows negative correlation with sales (r={max_neg:.3f})")
insights.append(f"📊 SIGNIFICANCE: {significant_count}/{len(target_correlations)} variables show meaningful correlation with sales (|r|>0.3)")
if len(strong_corr) > 0:
insights.append(f"🔗 MULTICOLLINEARITY: Found {len(strong_corr)} strong inter-variable relationships - consider for feature selection")
# Interpretation guide
insights.append(f"📈 INTERPRETATION: |r| > 0.7 (Strong), |r| 0.3-0.7 (Moderate), |r| < 0.3 (Weak)")
# Display insights
insight_text = '\n'.join(insights)
ax5.text(0.05, 0.95, '💡 KEY INSIGHTS:', fontsize=12, fontweight='bold',
transform=ax5.transAxes, va='top')
ax5.text(0.05, 0.85, insight_text, fontsize=10, transform=ax5.transAxes,
va='top', wrap=True, linespacing=1.5)
plt.suptitle('🔍 COMPREHENSIVE CORRELATION ANALYSIS', fontsize=16, fontweight='bold', y=0.98)
# Print detailed analysis
print(f"\n📋 DETAILED CORRELATION ANALYSIS")
print(f"="*50)
if target_col in corr_data.columns:
print(f"\n🎯 TOP CORRELATIONS WITH {target_col.upper()}:")
target_sorted = corr_data[target_col].drop(target_col).sort_values(key=abs, ascending=False)
for var, corr in target_sorted.head(5).items():
strength = "Very Strong" if abs(corr) > 0.8 else "Strong" if abs(corr) > 0.6 else "Moderate" if abs(corr) > 0.3 else "Weak"
direction = "Positive" if corr > 0 else "Negative"
print(f" {var:15} | {corr:+.3f} | {strength:12} | {direction}")
if strong_corr:
print(f"\n🔗 STRONG INTER-VARIABLE CORRELATIONS:")
for item in strong_df.head(5).to_dict('records'):
print(f" {item['Variable 1']} ↔ {item['Variable 2']}: {item['Correlation']:+.3f} ({item['Strength']})")
print(f"\n📊 SUMMARY STATISTICS:")
print(f" • Mean correlation: {correlations.mean():.3f}")
print(f" • Std correlation: {correlations.std():.3f}")
print(f" • Max correlation: {correlations.max():.3f}")
print(f" • Min correlation: {correlations.min():.3f}")
print(f" • Strong correlations (|r|>0.5): {len(strong_corr)}")
plt.tight_layout()
plt.show()
# Return analysis results
return {
'correlation_matrix': corr_data,
'target_correlations': corr_data[target_col].drop(target_col).sort_values(key=abs, ascending=False) if target_col in corr_data.columns else None,
'strong_correlations': pd.DataFrame(strong_corr) if strong_corr else pd.DataFrame(),
'summary_stats': {
'mean_correlation': correlations.mean(),
'std_correlation': correlations.std(),
'max_correlation': correlations.max(),
'min_correlation': correlations.min(),
'strong_count': len(strong_corr)
}
}
def correlation_feature_selector(df, target_col='sales', threshold=0.3, remove_multicollinear=True):
"""
Select features based on correlation analysis
Parameters:
-----------
df : pandas.DataFrame
Input dataframe
target_col : str
Target variable
threshold : float
Minimum correlation threshold with target
remove_multicollinear : bool
Whether to remove highly correlated features
"""
print(f"🎯 FEATURE SELECTION BASED ON CORRELATION")
print(f"="*50)
# Get numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
if target_col not in numeric_cols:
print(f"❌ Error: {target_col} not found in numeric columns")
return None
# Calculate correlations with target
corr_matrix = df[numeric_cols].corr()
target_corr = corr_matrix[target_col].drop(target_col)
# Select features based on correlation threshold
selected_features = target_corr[abs(target_corr) >= threshold].index.tolist()
print(f"📊 Initial features meeting threshold (|r| >= {threshold}): {len(selected_features)}")
for feat in selected_features:
print(f" {feat:20} | {target_corr[feat]:+.3f}")
if remove_multicollinear and len(selected_features) > 1:
print(f"\n🔗 Checking for multicollinearity...")
# Check correlations among selected features
selected_corr = corr_matrix.loc[selected_features, selected_features]
# Remove highly correlated features (keep the one most correlated with target)
to_remove = set()
for i in range(len(selected_features)):
for j in range(i+1, len(selected_features)):
feat1, feat2 = selected_features[i], selected_features[j]
if abs(selected_corr.loc[feat1, feat2]) > 0.8: # High correlation threshold
# Keep the feature more correlated with target
if abs(target_corr[feat1]) >= abs(target_corr[feat2]):
to_remove.add(feat2)
print(f" Removing {feat2} (corr with {feat1}: {selected_corr.loc[feat1, feat2]:.3f})")
else:
to_remove.add(feat1)
print(f" Removing {feat1} (corr with {feat2}: {selected_corr.loc[feat1, feat2]:.3f})")
# Final feature list
final_features = [f for f in selected_features if f not in to_remove]
print(f"\n✅ Final selected features: {len(final_features)}")
for feat in final_features:
print(f" {feat:20} | {target_corr[feat]:+.3f}")
return final_features
return selected_features
# Feature selection based on correlation
important_features = correlation_feature_selector(df_features, target_col='sales', threshold=0.3)
🎯 FEATURE SELECTION BASED ON CORRELATION ================================================== 📊 Initial features meeting threshold (|r| >= 0.3): 3 dayofweek | -0.461 customers | +0.896 open | +0.679 🔗 Checking for multicollinearity... ✅ Final selected features: 3 dayofweek | -0.461 customers | +0.896 open | +0.679
Enhnaced Correlation analysis¶
In [11]:
# Enhanced correlation analysis
results = enhanced_correlation_analysis(df_features, target_col='sales')
🔍 CORRELATION ANALYSIS ================================================== 📊 Analyzing 8 numeric variables 🎯 Primary target: sales 📋 Variables: ['store', 'dayofweek', 'sales', 'customers', 'open', 'week', 'quarter', 'year'] 📋 DETAILED CORRELATION ANALYSIS ================================================== 🎯 TOP CORRELATIONS WITH SALES: customers | +0.896 | Very Strong | Positive open | +0.679 | Strong | Positive dayofweek | -0.461 | Moderate | Negative week | +0.052 | Weak | Positive quarter | +0.043 | Weak | Positive 🔗 STRONG INTER-VARIABLE CORRELATIONS: week ↔ quarter: +0.942 (Very Strong) sales ↔ customers: +0.896 (Very Strong) sales ↔ open: +0.679 (Strong) customers ↔ open: +0.618 (Strong) dayofweek ↔ open: -0.527 (Strong) 📊 SUMMARY STATISTICS: • Mean correlation: 0.049 • Std correlation: 0.343 • Max correlation: 0.942 • Min correlation: -0.527 • Strong correlations (|r|>0.5): 5
Focused correlation matrix with only important features¶
In [12]:
# Focused correlation matrix with only important features
if important_features:
important_features.append('sales') # Add target back
focused_results = enhanced_correlation_analysis(df_features[important_features], target_col='sales', figsize=(12, 10))
🔍 CORRELATION ANALYSIS ================================================== 📊 Analyzing 4 numeric variables 🎯 Primary target: sales 📋 Variables: ['dayofweek', 'customers', 'open', 'sales'] 📋 DETAILED CORRELATION ANALYSIS ================================================== 🎯 TOP CORRELATIONS WITH SALES: customers | +0.896 | Very Strong | Positive open | +0.679 | Strong | Positive dayofweek | -0.461 | Moderate | Negative 🔗 STRONG INTER-VARIABLE CORRELATIONS: customers ↔ sales: +0.896 (Very Strong) open ↔ sales: +0.679 (Strong) customers ↔ open: +0.618 (Strong) dayofweek ↔ open: -0.527 (Strong) 📊 SUMMARY STATISTICS: • Mean correlation: 0.136 • Std correlation: 0.602 • Max correlation: 0.896 • Min correlation: -0.527 • Strong correlations (|r|>0.5): 4
3.2 Percentage Distribution per Holiday Type¶
In [13]:
# Get value counts efficiently
value_counts = df_features['stateholiday'].value_counts()
values = value_counts.values
labels = value_counts.index.tolist()
# Use original labels since holidaytype is already mapped from stateholiday
readable_labels = [str(label) for label in labels]
# Create pie chart with enhanced styling
fig = go.Figure(data=go.Pie(
values=values,
labels=readable_labels,
pull=[0.1 if i == 0 else 0 for i in range(len(values))], # Explode largest slice
hole=0.3, # Donut chart for better readability
textinfo='label+percent',
textposition='outside',
textfont=dict(size=10),
insidetextorientation='radial',
marker=dict(
colors=['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4', '#FFEAA7'],
line=dict(color='#FFFFFF', width=2)
)
))
# Enhanced layout
fig.update_layout(
title={
'text': 'Holiday Type Distribution',
'x': 0.02, # Position on the left side
'xanchor': 'left',
'y': 0.95, # Position at the top
'yanchor': 'top',
'font': {'size': 18, 'family': 'Arial, sans-serif'}
},
font=dict(size=12),
showlegend=True,
legend=dict(
orientation="v",
yanchor="middle",
y=0.5,
xanchor="left",
x=1.01
),
margin=dict(t=80, b=60, l=80, r=180),
width=800,
height=500
)
# Add summary statistics positioned to avoid obstruction
total_records = len(df_features)
holiday_records = len(df_features[df_features['stateholiday'] != 0])
holiday_percentage = (holiday_records / total_records) * 100
fig.add_annotation(
text=f"Total Records: {total_records:,}<br>Holiday Records: {holiday_records:,} ({holiday_percentage:.1f}%)",
xref="paper", yref="paper",
x=0.02, y=0.02, # Bottom left corner
showarrow=False,
font=dict(size=10, color="gray"),
align="left",
bgcolor="rgba(255,255,255,0.8)", # Semi-transparent background
bordercolor="gray",
borderwidth=1
)
fig.show(config={'displayModeBar': True, 'displaylogo': False})
# Display value counts for reference
print("\nHoliday Type Value Counts:")
print("-" * 30)
for label, count in zip(readable_labels, values):
percentage = (count / total_records) * 100
print(f"{label}: {count:,} ({percentage:.2f}%)")
print(f"\nTotal: {total_records:,} records")
Holiday Type Value Counts: ------------------------------ Normal Day: 951,594 (96.84%) Public: 20,260 (2.06%) Easter: 6,690 (0.68%) Christmas: 4,100 (0.42%) Total: 982,644 records
3.3 Percentage Distribution per School Holiday¶
In [14]:
import plotly.graph_objects as go
import pandas as pd
# Get value counts and create labels more efficiently
value_counts = df_features['schoolholiday'].value_counts()
values = value_counts.values
labels = ['School Holiday' if x == 1 else 'No School Holiday' for x in value_counts.index]
# Create pie chart with improved styling
fig = go.Figure(data=go.Pie(
values=values,
labels=labels,
pull=[0.1, 0], # Explode first slice
hole=0.3, # Creates a donut chart for better readability
textinfo='label+percent',
textposition='auto',
marker=dict(
colors=['#FF6B6B', '#4ECDC4'], # Custom colors
line=dict(color='#FFFFFF', width=2)
)
))
# Enhanced layout with better formatting
fig.update_layout(
title={
'text': 'School Holiday Distribution',
'x': 0.5, # Center the title
'xanchor': 'center',
'font': {'size': 18, 'family': 'Arial, sans-serif'}
},
font=dict(size=12),
showlegend=True,
legend=dict(
orientation="v",
yanchor="middle",
y=0.5,
xanchor="left",
x=1.01
),
margin=dict(t=60, b=40, l=40, r=120),
width=600,
height=400
)
# Add annotations for better context
total_records = len(df_features)
fig.add_annotation(
text=f"Total Records: {total_records:,}",
xref="paper", yref="paper",
x=0.5, y=-0.1,
showarrow=False,
font=dict(size=10, color="gray")
)
fig.show(config={'displayModeBar': True, 'displaylogo': False})
3.4 Sales Distribution¶
In [15]:
# Create histogram bins more efficiently
max_sales = df_features['sales'].max()
bin_width = 550
bins = np.arange(0, max_sales + bin_width, bin_width)
# Calculate histogram
counts, bin_edges = np.histogram(df_features['sales'], bins=bins)
bin_centers = 0.5 * (bin_edges[:-1] + bin_edges[1:])
# Create improved bar chart
fig = px.bar(
x=bin_centers,
y=counts,
labels={
'x': 'Sales Value',
'y': 'Frequency',
'color': 'Count'
},
title='Sales Distribution',
color=counts, # Color gradient based on frequency
color_continuous_scale='viridis'
)
# Enhanced styling
fig.update_layout(
title={
'text': 'Sales Distribution Analysis',
'x': 0.02,
'xanchor': 'left',
'y': 0.95,
'yanchor': 'top',
'font': {'size': 18, 'family': 'Arial, sans-serif'}
},
xaxis_title='Sales Value',
yaxis_title='Frequency',
bargap=0.02, # Minimal gap for histogram effect
font=dict(size=12),
plot_bgcolor='white',
width=1300,
height=600,
margin=dict(t=80, b=60, l=60, r=60),
# FIXED: Proper axis formatting for px.bar figures
xaxis=dict(
tickformat=',', # Add thousands separator
showgrid=True,
gridwidth=1,
gridcolor='lightgray'
),
yaxis=dict(
showgrid=True,
gridwidth=1,
gridcolor='lightgray'
)
)
# Improve trace styling
fig.update_traces(
marker_line_width=0.5,
marker_line_color='white'
)
# Add statistical annotations
mean_sales = df_features['sales'].mean()
median_sales = df_features['sales'].median()
std_sales = df_features['sales'].std()
fig.add_annotation(
text=f"Mean: {mean_sales:,.0f}<br>Median: {median_sales:,.0f}<br>Std Dev: {std_sales:,.0f}",
xref="paper", yref="paper",
x=0.98, y=0.98,
xanchor="right", yanchor="top",
showarrow=False,
font=dict(size=10, color="black"),
align="right",
bgcolor="rgba(255,255,255,0.9)",
bordercolor="gray",
borderwidth=1,
borderpad=8
)
# Add vertical lines for mean and median
fig.add_vline(
x=mean_sales,
line_dash="dash",
line_color="red",
annotation_text="Mean",
annotation_position="top"
)
fig.add_vline(
x=median_sales,
line_dash="dot",
line_color="blue",
annotation_text="Median",
annotation_position="top"
)
fig.show(config={'displayModeBar': True, 'displaylogo': False})
# Print summary statistics
print("\nSales Distribution Summary:")
print("-" * 40)
print(f"Total Records: {len(df_features):,}")
print(f"Min Sales: {df_features['sales'].min():,}")
print(f"Max Sales: {df_features['sales'].max():,}")
print(f"Mean Sales: {mean_sales:,.2f}")
print(f"Median Sales: {median_sales:,.2f}")
print(f"Standard Deviation: {std_sales:,.2f}")
print(f"Bin Width: {bin_width}")
print(f"Number of Bins: {len(counts)}")
Sales Distribution Summary: ---------------------------------------- Total Records: 982,644 Min Sales: 0 Max Sales: 41,551 Mean Sales: 5,760.84 Median Sales: 5,731.00 Standard Deviation: 3,857.57 Bin Width: 550 Number of Bins: 76
3.4 Customers Distribution¶
In [16]:
# Create histogram bins more efficiently
max_sales = df_features['customers'].max()
bin_width = 50
bins = np.arange(0, max_sales + bin_width, bin_width)
# Calculate histogram
counts, bin_edges = np.histogram(df_features['customers'], bins=bins)
bin_centers = 0.5 * (bin_edges[:-1] + bin_edges[1:])
# Create improved bar chart
fig = px.bar(
x=bin_centers,
y=counts,
labels={
'x': 'Customers Value',
'y': 'Frequency',
'color': 'Count'
},
title='Customers Distribution',
color=counts, # Color gradient based on frequency
color_continuous_scale='viridis'
)
# Enhanced styling
fig.update_layout(
title={
'text': 'Customers Distribution Analysis',
'x': 0.02,
'xanchor': 'left',
'y': 0.95,
'yanchor': 'top',
'font': {'size': 18, 'family': 'Arial, sans-serif'}
},
xaxis_title='Customers Value',
yaxis_title='Frequency',
bargap=0.02, # Minimal gap for histogram effect
font=dict(size=12),
plot_bgcolor='white',
width=1050,
height=400,
margin=dict(t=80, b=60, l=60, r=60),
# FIXED: Proper axis formatting for px.bar figures
xaxis=dict(
tickformat=',', # Add thousands separator
showgrid=True,
gridwidth=1,
gridcolor='lightgray'
),
yaxis=dict(
showgrid=True,
gridwidth=1,
gridcolor='lightgray'
)
)
# Improve trace styling
fig.update_traces(
marker_line_width=0.5,
marker_line_color='white'
)
# Add statistical annotations
mean_sales = df_features['customers'].mean()
median_sales = df_features['customers'].median()
std_sales = df_features['customers'].std()
fig.add_annotation(
text=f"Mean: {mean_sales:,.0f}<br>Median: {median_sales:,.0f}<br>Std Dev: {std_sales:,.0f}",
xref="paper", yref="paper",
x=0.98, y=0.98,
xanchor="right", yanchor="top",
showarrow=False,
font=dict(size=10, color="black"),
align="right",
bgcolor="rgba(255,255,255,0.9)",
bordercolor="gray",
borderwidth=1,
borderpad=8
)
# Add vertical lines for mean and median
fig.add_vline(
x=mean_sales,
line_dash="dash",
line_color="red",
annotation_text="Mean",
annotation_position="top"
)
fig.add_vline(
x=median_sales,
line_dash="dot",
line_color="blue",
annotation_text="Median",
annotation_position="top"
)
fig.show(config={'displayModeBar': True, 'displaylogo': False})
# Print summary statistics
print("\nCustomers Distribution Summary:")
print("-" * 40)
print(f"Total Records: {len(df_features):,}")
print(f"Min Customers: {df_features['customers'].min():,}")
print(f"Max Customers: {df_features['customers'].max():,}")
print(f"Mean Customers: {mean_sales:,.2f}")
print(f"Median Customers: {median_sales:,.2f}")
print(f"Standard Deviation: {std_sales:,.2f}")
print(f"Bin Width: {bin_width}")
print(f"Number of Bins: {len(counts)}")
Customers Distribution Summary: ---------------------------------------- Total Records: 982,644 Min Customers: 0 Max Customers: 7,388 Mean Customers: 632.77 Median Customers: 609.00 Standard Deviation: 465.40 Bin Width: 50 Number of Bins: 148
3.6 Customer Analysis¶
Average Customers Trend per Day¶
In [17]:
# Define weekday order
weekday_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
# Calculate mean customers by day
dow_agg = df_features.groupby('day', as_index=False)[['customers']].mean()
# Apply categorical ordering
dow_agg['day'] = pd.Categorical(dow_agg['day'], categories=weekday_order, ordered=True)
dow_agg = dow_agg.sort_values('day')
# Identify peak and lowest days
peak_row = dow_agg.loc[dow_agg['customers'].idxmax()]
lowest_row = dow_agg.loc[dow_agg['customers'].idxmin()]
peak_day = peak_row['day']
peak_value = peak_row['customers']
lowest_day = lowest_row['day']
lowest_value = lowest_row['customers']
# Create enhanced line chart
fig = px.line(
dow_agg,
x='day',
y='customers',
title='Average Customer Traffic by Day of Week',
markers=True,
line_shape='spline' # Smooth curve
)
# Enhanced styling
fig.update_layout(
title={
'text': 'Average Customer Traffic by Day of Week',
'x': 0.02,
'xanchor': 'left',
'y': 0.95,
'yanchor': 'top',
'font': {'size': 18, 'family': 'Arial, sans-serif'}
},
xaxis_title='Day of Week',
yaxis_title='Average Number of Customers',
font=dict(size=12),
plot_bgcolor='white',
width=900,
height=500,
margin=dict(t=80, b=60, l=80, r=60),
xaxis=dict(
showgrid=True,
gridwidth=1,
gridcolor='lightgray'
),
yaxis=dict(
showgrid=True,
gridwidth=1,
gridcolor='lightgray',
tickformat=','
)
)
# Improve line and marker styling
fig.update_traces(
line=dict(width=3, color='#2E8B57'),
marker=dict(
size=8,
color='#FF6B6B',
line=dict(width=2, color='white')
)
)
# Add annotation for peak day
fig.add_annotation(
x=peak_day,
y=peak_value,
text=f"Peak: {peak_day}<br>({peak_value:,.0f} customers)",
showarrow=True,
arrowhead=2,
ax=0,
ay=-50,
font=dict(color="red", size=11),
bgcolor="rgba(255,255,255,0.9)",
bordercolor="red",
borderwidth=1
)
# Add annotation for lowest day
fig.add_annotation(
x=lowest_day,
y=lowest_value,
text=f"Lowest: {lowest_day}<br>({lowest_value:,.0f} customers)",
showarrow=True,
arrowhead=2,
ax=0,
ay=50,
font=dict(color="blue", size=11),
bgcolor="rgba(255,255,255,0.9)",
bordercolor="blue",
borderwidth=1
)
# Add summary statistics box
avg_customers = dow_agg['customers'].mean()
std_customers = dow_agg['customers'].std()
range_customers = peak_value - lowest_value
fig.add_annotation(
text=f"Weekly Average: {avg_customers:,.0f}<br>Standard Deviation: {std_customers:,.0f}<br>Range: {range_customers:,.0f}",
xref="paper", yref="paper",
x=0.98, y=0.98,
xanchor="right", yanchor="top",
showarrow=False,
font=dict(size=10, color="black"),
align="right",
bgcolor="rgba(255,255,255,0.9)",
bordercolor="gray",
borderwidth=1,
borderpad=8
)
fig.show(config={'displayModeBar': True, 'displaylogo': False})
# Print detailed summary
print("\nWeekday Customer Traffic Summary:")
print("-" * 45)
for _, row in dow_agg.iterrows():
day = row['day']
customers = row['customers']
percentage_of_peak = (customers / peak_value) * 100
print(f"{day}: {customers:,.0f} customers ({percentage_of_peak:.1f}% of peak)")
print(f"\nWeekly Statistics:")
print(f"Average daily customers: {avg_customers:,.0f}")
print(f"Peak day: {peak_day} ({peak_value:,.0f} customers)")
print(f"Lowest day: {lowest_day} ({lowest_value:,.0f} customers)")
print(f"Difference (Peak - Lowest): {range_customers:,.0f} customers")
Weekday Customer Traffic Summary: --------------------------------------------- Mon: 813 customers (100.0% of peak) Tue: 762 customers (93.7% of peak) Wed: 721 customers (88.7% of peak) Thu: 696 customers (85.6% of peak) Fri: 743 customers (91.3% of peak) Sat: 659 customers (81.0% of peak) Sun: 36 customers (4.4% of peak) Weekly Statistics: Average daily customers: 633 Peak day: Mon (813 customers) Lowest day: Sun (36 customers) Difference (Peak - Lowest): 777 customers
Average Customers Trend per Month¶
In [18]:
# Ensure month is ordered
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun','Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
# Calculate monthly averages
monthly_grp = df_features.groupby('month', as_index=False)['customers'].mean()
monthly_grp['month'] = pd.Categorical(monthly_grp['month'], categories=month_order, ordered=True)
monthly_grp = monthly_grp.sort_values('month')
# Identify peak and lowest months
peak_row = monthly_grp.loc[monthly_grp['customers'].idxmax()]
lowest_row = monthly_grp.loc[monthly_grp['customers'].idxmin()]
peak_month = peak_row['month']
peak_value = peak_row['customers']
lowest_month = lowest_row['month']
lowest_value = lowest_row['customers']
# Create enhanced line chart
fig = px.line(
monthly_grp,
x='month',
y='customers',
markers=True,
title='Average Customer Traffic by Month',
labels={'customers': 'Average Customers', 'month': 'Month'},
line_shape='spline' # Smooth curves for better trend visualization
)
# Enhanced styling
fig.update_layout(
title={
'text': 'Average Customer Traffic by Month',
'x': 0.02,
'xanchor': 'left',
'y': 0.95,
'yanchor': 'top',
'font': {'size': 18, 'family': 'Arial, sans-serif'}
},
xaxis_title='Month',
yaxis_title='Average Number of Customers',
font=dict(size=12),
plot_bgcolor='white',
width=1000,
height=550,
margin=dict(t=80, b=60, l=80, r=80),
xaxis=dict(
showgrid=True,
gridwidth=1,
gridcolor='lightgray'
),
yaxis=dict(
showgrid=True,
gridwidth=1,
gridcolor='lightgray',
tickformat=','
)
)
# Improve line and marker styling
fig.update_traces(
line=dict(width=3, color='#1f77b4'),
marker=dict(
size=10,
color='#ff7f0e',
line=dict(width=2, color='white')
)
)
# Annotate peak month
fig.add_annotation(
x=peak_month,
y=peak_value,
text=f'Peak: {peak_month}<br>({peak_value:,.0f} customers)',
showarrow=True,
arrowhead=2,
arrowsize=1,
arrowwidth=2,
arrowcolor='red',
font=dict(color='red', size=11),
yshift=20,
bgcolor="rgba(255,255,255,0.9)",
bordercolor="red",
borderwidth=1
)
# Annotate lowest month
fig.add_annotation(
x=lowest_month,
y=lowest_value,
text=f'Lowest: {lowest_month}<br>({lowest_value:,.0f} customers)',
showarrow=True,
arrowhead=2,
arrowsize=1,
arrowwidth=2,
arrowcolor='blue',
font=dict(color='blue', size=11),
yshift=-30,
bgcolor="rgba(255,255,255,0.9)",
bordercolor="blue",
borderwidth=1
)
# Add seasonal analysis
spring_months = ['Mar', 'Apr', 'May']
summer_months = ['Jun', 'Jul', 'Aug']
fall_months = ['Sep', 'Oct', 'Nov']
winter_months = ['Dec', 'Jan', 'Feb']
spring_avg = monthly_grp[monthly_grp['month'].isin(spring_months)]['customers'].mean()
summer_avg = monthly_grp[monthly_grp['month'].isin(summer_months)]['customers'].mean()
fall_avg = monthly_grp[monthly_grp['month'].isin(fall_months)]['customers'].mean()
winter_avg = monthly_grp[monthly_grp['month'].isin(winter_months)]['customers'].mean()
# Add summary statistics
yearly_avg = monthly_grp['customers'].mean()
yearly_std = monthly_grp['customers'].std()
range_customers = peak_value - lowest_value
fig.add_annotation(
text=f"Yearly Average: {yearly_avg:,.0f}<br>" +
f"Standard Deviation: {yearly_std:,.0f}<br>" +
f"Range: {range_customers:,.0f}<br><br>" +
f"Seasonal Averages:<br>" +
f"Spring: {spring_avg:,.0f}<br>" +
f"Summer: {summer_avg:,.0f}<br>" +
f"Fall: {fall_avg:,.0f}<br>" +
f"Winter: {winter_avg:,.0f}",
xref="paper", yref="paper",
x=0.02, y=0.5, # Middle left side
xanchor="left", yanchor="middle",
showarrow=False,
font=dict(size=10, color="black"),
align="left",
bgcolor="rgba(255,255,255,0.9)",
bordercolor="gray",
borderwidth=1,
borderpad=8
)
fig.show(config={'displayModeBar': True, 'displaylogo': False})
# Print detailed monthly summary
print("\nMonthly Customer Traffic Summary:")
print("-" * 50)
for _, row in monthly_grp.iterrows():
month = row['month']
customers = row['customers']
percentage_of_peak = (customers / peak_value) * 100
print(f"{month}: {customers:,.0f} customers ({percentage_of_peak:.1f}% of peak)")
print(f"\nYearly Statistics:")
print(f"Average monthly customers: {yearly_avg:,.0f}")
print(f"Peak month: {peak_month} ({peak_value:,.0f} customers)")
print(f"Lowest month: {lowest_month} ({lowest_value:,.0f} customers)")
print(f"Difference (Peak - Lowest): {range_customers:,.0f} customers")
print(f"\nSeasonal Analysis:")
print(f"Spring (Mar-May): {spring_avg:,.0f} customers")
print(f"Summer (Jun-Aug): {summer_avg:,.0f} customers")
print(f"Fall (Sep-Nov): {fall_avg:,.0f} customers")
print(f"Winter (Dec-Feb): {winter_avg:,.0f} customers")
Monthly Customer Traffic Summary: -------------------------------------------------- Jan: 602 customers (85.6% of peak) Feb: 627 customers (89.1% of peak) Mar: 629 customers (89.5% of peak) Apr: 631 customers (89.7% of peak) May: 602 customers (85.6% of peak) Jun: 625 customers (88.9% of peak) Jul: 664 customers (94.4% of peak) Aug: 642 customers (91.4% of peak) Sep: 634 customers (90.2% of peak) Oct: 631 customers (89.8% of peak) Nov: 654 customers (93.0% of peak) Dec: 703 customers (100.0% of peak) Yearly Statistics: Average monthly customers: 637 Peak month: Dec (703 customers) Lowest month: Jan (602 customers) Difference (Peak - Lowest): 101 customers Seasonal Analysis: Spring (Mar-May): 621 customers Summer (Jun-Aug): 644 customers Fall (Sep-Nov): 640 customers Winter (Dec-Feb): 644 customers
Top 10 Crowded Stores¶
In [19]:
# Group and get top 10 stores with highest average customers
top10_crowded_store = (
df_features.groupby('store', as_index=False)['customers']
.mean()
.nlargest(10, 'customers')
)
# Add ranking column for better visualization
top10_crowded_store['rank'] = range(1, 11)
top10_crowded_store['store_label'] = 'Store ' + top10_crowded_store['store'].astype(str)
# Create enhanced bar chart
fig = px.bar(
top10_crowded_store,
x='store_label',
y='customers',
title='Top 10 Highest Traffic Stores',
labels={'store_label': 'Store', 'customers': 'Average Daily Customers'},
color='customers', # Color gradient based on customer count
color_continuous_scale='viridis',
text='customers' # Show values on bars
)
# Enhanced styling
fig.update_layout(
title={
'text': 'Top 10 Highest Traffic Stores',
'x': 0.02,
'xanchor': 'left',
'y': 0.95,
'yanchor': 'top',
'font': {'size': 20, 'family': 'Arial, sans-serif'}
},
xaxis_title='Store',
yaxis_title='Average Daily Customers',
font=dict(size=12),
plot_bgcolor='white',
width=1200,
height=600,
margin=dict(t=100, b=80, l=80, r=80),
xaxis=dict(
showgrid=False,
tickangle=45 # Rotate labels for better readability
),
yaxis=dict(
showgrid=True,
gridwidth=1,
gridcolor='lightgray',
tickformat=','
),
showlegend=False # Hide color scale legend for cleaner look
)
# Format text on bars
fig.update_traces(
texttemplate='%{text:,.0f}',
textposition='outside',
textfont_size=11,
textfont_color='black',
marker_line_width=1,
marker_line_color='white'
)
# Add performance insights
top_store = top10_crowded_store.iloc[0]
bottom_store = top10_crowded_store.iloc[-1]
avg_of_top10 = top10_crowded_store['customers'].mean()
total_stores = df_features['store'].nunique()
# Add summary annotation
fig.add_annotation(
text=f"Performance Summary:<br>" +
f"#1 Store {top_store['store']}: {top_store['customers']:,.0f} customers<br>" +
f"#10 Store {bottom_store['store']}: {bottom_store['customers']:,.0f} customers<br>" +
f"Top 10 Average: {avg_of_top10:,.0f}<br>" +
f"Gap (1st - 10th): {top_store['customers'] - bottom_store['customers']:,.0f}<br>" +
f"Total Stores: {total_stores}",
xref="paper", yref="paper",
x=0.02, y=0.5, # Left center
xanchor="left", yanchor="middle",
showarrow=False,
font=dict(size=11, color="black"),
align="left",
bgcolor="rgba(255,255,255,0.95)",
bordercolor="gray",
borderwidth=1,
borderpad=10
)
# Add reference line for top 10 average
fig.add_hline(
y=avg_of_top10,
line_dash="dash",
line_color="red",
annotation_text=f"Top 10 Avg: {avg_of_top10:,.0f}",
annotation_position="top right"
)
fig.show(config={'displayModeBar': True, 'displaylogo': False})
# Enhanced data analysis
print("\nTop 10 Stores Performance Analysis:")
print("=" * 55)
print(f"{'Rank':<4} {'Store ID':<8} {'Avg Customers':<15} {'% of #1 Store':<12}")
print("-" * 55)
for i, row in top10_crowded_store.iterrows():
rank = row['rank']
store_id = row['store']
customers = row['customers']
pct_of_top = (customers / top_store['customers']) * 100
print(f"{rank:<4} {store_id:<8} {customers:>10,.0f} {pct_of_top:>8.1f}%")
print(f"\nSummary Statistics:")
print("-" * 25)
print(f"Total stores analyzed: {total_stores}")
print(f"Top 10 average: {avg_of_top10:,.0f} customers")
print(f"Performance range: {top_store['customers'] - bottom_store['customers']:,.0f} customers")
print(f"Standard deviation: {top10_crowded_store['customers'].std():,.0f}")
# Additional insights
overall_avg = df_features.groupby('store')['customers'].mean().mean()
top10_vs_overall = ((avg_of_top10 / overall_avg) - 1) * 100
print(f"\nComparative Analysis:")
print("-" * 20)
print(f"Overall store average: {overall_avg:,.0f} customers")
print(f"Top 10 outperform overall average by: {top10_vs_overall:.1f}%")
Top 10 Stores Performance Analysis: ======================================================= Rank Store ID Avg Customers % of #1 Store ------------------------------------------------------- 1 733 3,403 100.0% 2 262 3,400 99.9% 3 562 3,107 91.3% 4 769 3,072 90.2% 5 1114 2,653 77.9% 6 817 2,605 76.5% 7 1097 2,412 70.9% 8 335 2,391 70.2% 9 259 2,334 68.6% 10 251 2,028 59.6% Summary Statistics: ------------------------- Total stores analyzed: 1115 Top 10 average: 2,740 customers Performance range: 1,375 customers Standard deviation: 477 Comparative Analysis: -------------------- Overall store average: 629 customers Top 10 outperform overall average by: 335.7%
3.7 Sales Analysis¶
Average SalesTrend per Day¶
In [20]:
# Define weekday order
weekday_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
# Calculate mean customers by day
dow_agg = df_features.groupby('day', as_index=False)[['sales']].mean()
# Apply categorical ordering
dow_agg['day'] = pd.Categorical(dow_agg['day'], categories=weekday_order, ordered=True)
dow_agg = dow_agg.sort_values('day')
# Identify peak and lowest days
peak_row = dow_agg.loc[dow_agg['sales'].idxmax()]
lowest_row = dow_agg.loc[dow_agg['sales'].idxmin()]
peak_day = peak_row['day']
peak_value = peak_row['sales']
lowest_day = lowest_row['day']
lowest_value = lowest_row['sales']
# Create enhanced line chart
fig = px.line(
dow_agg,
x='day',
y='sales',
title='Average Sales Traffic by Day of Week',
markers=True,
line_shape='spline' # Smooth curve
)
# Enhanced styling
fig.update_layout(
title={
'text': 'Average Sales Traffic by Day of Week',
'x': 0.02,
'xanchor': 'left',
'y': 0.95,
'yanchor': 'top',
'font': {'size': 18, 'family': 'Arial, sans-serif'}
},
xaxis_title='Day of Week',
yaxis_title='Average Sales',
font=dict(size=12),
plot_bgcolor='white',
width=900,
height=500,
margin=dict(t=80, b=60, l=80, r=60),
xaxis=dict(
showgrid=True,
gridwidth=1,
gridcolor='lightgray'
),
yaxis=dict(
showgrid=True,
gridwidth=1,
gridcolor='lightgray',
tickformat=','
)
)
# Improve line and marker styling
fig.update_traces(
line=dict(width=3, color='#2E8B57'),
marker=dict(
size=8,
color='#FF6B6B',
line=dict(width=2, color='white')
)
)
# Add annotation for peak day
fig.add_annotation(
x=peak_day,
y=peak_value,
text=f"Peak: {peak_day}<br>({peak_value:,.0f} sales)",
showarrow=True,
arrowhead=2,
ax=0,
ay=-50,
font=dict(color="red", size=11),
bgcolor="rgba(255,255,255,0.9)",
bordercolor="red",
borderwidth=1
)
# Add annotation for lowest day
fig.add_annotation(
x=lowest_day,
y=lowest_value,
text=f"Lowest: {lowest_day}<br>({lowest_value:,.0f} sales)",
showarrow=True,
arrowhead=2,
ax=0,
ay=50,
font=dict(color="blue", size=11),
bgcolor="rgba(255,255,255,0.9)",
bordercolor="blue",
borderwidth=1
)
# Add summary statistics box
avg_sales = dow_agg['sales'].mean()
std_sales = dow_agg['sales'].std()
range_sales = peak_value - lowest_value
fig.add_annotation(
text=f"Weekly Average: {avg_customers:,.0f}<br>Standard Deviation: {std_customers:,.0f}<br>Range: {range_customers:,.0f}",
xref="paper", yref="paper",
x=0.98, y=0.98,
xanchor="right", yanchor="top",
showarrow=False,
font=dict(size=10, color="black"),
align="right",
bgcolor="rgba(255,255,255,0.9)",
bordercolor="gray",
borderwidth=1,
borderpad=8
)
fig.show(config={'displayModeBar': True, 'displaylogo': False})
# Print detailed summary
print("\nWeekday Sales Traffic Summary:")
print("-" * 45)
for _, row in dow_agg.iterrows():
day = row['day']
sales = row['sales']
percentage_of_peak = (sales / peak_value) * 100
print(f"{day}: {sales:,.0f} sales ({percentage_of_peak:.1f}% of peak)")
print(f"\nWeekly Statistics:")
print(f"Average daily Sales: {avg_sales:,.0f}")
print(f"Peak day: {peak_day} ({peak_value:,.0f} sales)")
print(f"Lowest day: {lowest_day} ({lowest_value:,.0f} sales)")
print(f"Difference (Peak - Lowest): {range_sales:,.0f} sales")
Weekday Sales Traffic Summary: --------------------------------------------- Mon: 7,798 sales (100.0% of peak) Tue: 7,006 sales (89.8% of peak) Wed: 6,536 sales (83.8% of peak) Thu: 6,216 sales (79.7% of peak) Fri: 6,704 sales (86.0% of peak) Sat: 5,857 sales (75.1% of peak) Sun: 203 sales (2.6% of peak) Weekly Statistics: Average daily Sales: 5,760 Peak day: Mon (7,798 sales) Lowest day: Sun (203 sales) Difference (Peak - Lowest): 7,595 sales
Holiday and Schoolday Impact¶
In [21]:
# Ensure month is ordered
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun','Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
# Calculate monthly averages
monthly_grp = df_features.groupby('month', as_index=False)['sales'].mean()
monthly_grp['month'] = pd.Categorical(monthly_grp['month'], categories=month_order, ordered=True)
monthly_grp = monthly_grp.sort_values('month')
# Identify peak and lowest months
peak_row = monthly_grp.loc[monthly_grp['sales'].idxmax()]
lowest_row = monthly_grp.loc[monthly_grp['sales'].idxmin()]
peak_month = peak_row['month']
peak_value = peak_row['sales']
lowest_month = lowest_row['month']
lowest_value = lowest_row['sales']
# Create enhanced line chart
fig = px.line(
monthly_grp,
x='month',
y='sales',
markers=True,
title='Average Sales Traffic by Month',
labels={'customers': 'Average Sales', 'month': 'Month'},
line_shape='spline' # Smooth curves for better trend visualization
)
# Enhanced styling
fig.update_layout(
title={
'text': 'Average Sales Traffic by Month',
'x': 0.02,
'xanchor': 'left',
'y': 0.95,
'yanchor': 'top',
'font': {'size': 18, 'family': 'Arial, sans-serif'}
},
xaxis_title='Month',
yaxis_title='Average Number of Sales',
font=dict(size=12),
plot_bgcolor='white',
width=1000,
height=550,
margin=dict(t=80, b=60, l=80, r=80),
xaxis=dict(
showgrid=True,
gridwidth=1,
gridcolor='lightgray'
),
yaxis=dict(
showgrid=True,
gridwidth=1,
gridcolor='lightgray',
tickformat=','
)
)
# Improve line and marker styling
fig.update_traces(
line=dict(width=3, color='#1f77b4'),
marker=dict(
size=10,
color='#ff7f0e',
line=dict(width=2, color='white')
)
)
# Annotate peak month
fig.add_annotation(
x=peak_month,
y=peak_value,
text=f'Peak: {peak_month}<br>({peak_value:,.0f} sales)',
showarrow=True,
arrowhead=2,
arrowsize=1,
arrowwidth=2,
arrowcolor='red',
font=dict(color='red', size=11),
yshift=20,
bgcolor="rgba(255,255,255,0.9)",
bordercolor="red",
borderwidth=1
)
# Annotate lowest month
fig.add_annotation(
x=lowest_month,
y=lowest_value,
text=f'Lowest: {lowest_month}<br>({lowest_value:,.0f} sales)',
showarrow=True,
arrowhead=2,
arrowsize=1,
arrowwidth=2,
arrowcolor='blue',
font=dict(color='blue', size=11),
yshift=-30,
bgcolor="rgba(255,255,255,0.9)",
bordercolor="blue",
borderwidth=1
)
# Add seasonal analysis
spring_months = ['Mar', 'Apr', 'May']
summer_months = ['Jun', 'Jul', 'Aug']
fall_months = ['Sep', 'Oct', 'Nov']
winter_months = ['Dec', 'Jan', 'Feb']
spring_avg = monthly_grp[monthly_grp['month'].isin(spring_months)]['sales'].mean()
summer_avg = monthly_grp[monthly_grp['month'].isin(summer_months)]['sales'].mean()
fall_avg = monthly_grp[monthly_grp['month'].isin(fall_months)]['sales'].mean()
winter_avg = monthly_grp[monthly_grp['month'].isin(winter_months)]['sales'].mean()
# Add summary statistics
yearly_avg = monthly_grp['sales'].mean()
yearly_std = monthly_grp['sales'].std()
range_customers = peak_value - lowest_value
fig.add_annotation(
text=f"Yearly Average: {yearly_avg:,.0f}<br>" +
f"Standard Deviation: {yearly_std:,.0f}<br>" +
f"Range: {range_customers:,.0f}<br><br>" +
f"Seasonal Averages:<br>" +
f"Spring: {spring_avg:,.0f}<br>" +
f"Summer: {summer_avg:,.0f}<br>" +
f"Fall: {fall_avg:,.0f}<br>" +
f"Winter: {winter_avg:,.0f}",
xref="paper", yref="paper",
x=0.02, y=0.5, # Middle left side
xanchor="left", yanchor="middle",
showarrow=False,
font=dict(size=10, color="black"),
align="left",
bgcolor="rgba(255,255,255,0.9)",
bordercolor="gray",
borderwidth=1,
borderpad=8
)
fig.show(config={'displayModeBar': True, 'displaylogo': False})
# Print detailed monthly summary
print("\nMonthly Sales Traffic Summary:")
print("-" * 50)
for _, row in monthly_grp.iterrows():
month = row['month']
sales = row['sales']
percentage_of_peak = (sales / peak_value) * 100
print(f"{month}: {sales:,.0f} sales ({percentage_of_peak:.1f}% of peak)")
print(f"\nYearly Statistics:")
print(f"Average monthly sales: {yearly_avg:,.0f}")
print(f"Peak month: {peak_month} ({peak_value:,.0f} sales)")
print(f"Lowest month: {lowest_month} ({lowest_value:,.0f} sales)")
print(f"Difference (Peak - Lowest): {range_customers:,.0f} sales")
print(f"\nSeasonal Analysis:")
print(f"Spring (Mar-May): {spring_avg:,.0f} sales")
print(f"Summer (Jun-Aug): {summer_avg:,.0f} sales")
print(f"Fall (Sep-Nov): {fall_avg:,.0f} sales")
print(f"Winter (Dec-Feb): {winter_avg:,.0f} sales")
Monthly Sales Traffic Summary: -------------------------------------------------- Jan: 5,465 sales (80.1% of peak) Feb: 5,645 sales (82.7% of peak) Mar: 5,785 sales (84.7% of peak) Apr: 5,739 sales (84.1% of peak) May: 5,490 sales (80.4% of peak) Jun: 5,761 sales (84.4% of peak) Jul: 6,023 sales (88.2% of peak) Aug: 5,693 sales (83.4% of peak) Sep: 5,570 sales (81.6% of peak) Oct: 5,537 sales (81.1% of peak) Nov: 6,008 sales (88.0% of peak) Dec: 6,827 sales (100.0% of peak) Yearly Statistics: Average monthly sales: 5,795 Peak month: Dec (6,827 sales) Lowest month: Jan (5,465 sales) Difference (Peak - Lowest): 1,361 sales Seasonal Analysis: Spring (Mar-May): 5,671 sales Summer (Jun-Aug): 5,826 sales Fall (Sep-Nov): 5,705 sales Winter (Dec-Feb): 5,979 sales
Promo Impact Summary¶
In [22]:
def clean_promo_analysis(df, sales_col='sales', customers_col='customers',
store_col='store', promo_col='promo', date_col='date', top_n=10):
"""
Clean and comprehensive promotional impact analysis
"""
print("🎯 PROMOTIONAL IMPACT ANALYSIS REPORT")
print("="*60)
# Data preprocessing
df_clean = df.copy()
# Remove closed stores (sales = 0)
df_clean = df_clean[df_clean[sales_col] > 0]
print(f"📊 Data Overview: {len(df_clean):,} records after removing closed days")
# Create binary promo flag
df_clean['promo_flag'] = (df_clean[promo_col] == 'Promo').astype(int)
# Get top stores by average sales
top_stores = df_clean.groupby(store_col)[sales_col].mean().nlargest(top_n).index
df_analysis = df_clean[df_clean[store_col].isin(top_stores)]
print(f"🏪 Analyzing top {len(top_stores)} stores: {list(top_stores)}")
print(f"📈 Analysis dataset: {len(df_analysis):,} records")
# Split data
promo_data = df_analysis[df_analysis['promo_flag'] == 1]
non_promo_data = df_analysis[df_analysis['promo_flag'] == 0]
print(f"🎯 Promotional days: {len(promo_data):,} ({len(promo_data)/len(df_analysis)*100:.1f}%)")
print(f"📅 Regular days: {len(non_promo_data):,} ({len(non_promo_data)/len(df_analysis)*100:.1f}%)")
# Calculate key metrics
results = {}
# Sales metrics
promo_avg_sales = promo_data[sales_col].mean()
non_promo_avg_sales = non_promo_data[sales_col].mean()
sales_lift = promo_avg_sales - non_promo_avg_sales
sales_lift_pct = (sales_lift / non_promo_avg_sales) * 100
# Customer metrics
promo_avg_customers = promo_data[customers_col].mean()
non_promo_avg_customers = non_promo_data[customers_col].mean()
customer_lift = promo_avg_customers - non_promo_avg_customers
customer_lift_pct = (customer_lift / non_promo_avg_customers) * 100
# Efficiency metrics
promo_sales_per_customer = promo_avg_sales / promo_avg_customers
non_promo_sales_per_customer = non_promo_avg_sales / non_promo_avg_customers
efficiency_improvement = ((promo_sales_per_customer - non_promo_sales_per_customer) /
non_promo_sales_per_customer) * 100
# Statistical test
t_stat, p_value = ttest_ind(promo_data[sales_col], non_promo_data[sales_col])
is_significant = p_value < 0.05
# Store-level analysis
store_results = []
for store in top_stores:
store_data = df_analysis[df_analysis[store_col] == store]
store_promo = store_data[store_data['promo_flag'] == 1]
store_regular = store_data[store_data['promo_flag'] == 0]
if len(store_promo) > 0 and len(store_regular) > 0:
store_sales_lift = ((store_promo[sales_col].mean() - store_regular[sales_col].mean()) /
store_regular[sales_col].mean()) * 100
store_customer_lift = ((store_promo[customers_col].mean() - store_regular[customers_col].mean()) /
store_regular[customers_col].mean()) * 100
store_results.append({
'Store': store,
'Promo Days': len(store_promo),
'Regular Days': len(store_regular),
'Promo Rate (%)': len(store_promo) / len(store_data) * 100,
'Sales Lift (%)': store_sales_lift,
'Customer Lift (%)': store_customer_lift,
'Promo Avg Sales': store_promo[sales_col].mean(),
'Regular Avg Sales': store_regular[sales_col].mean(),
'Promo Avg Customers': store_promo[customers_col].mean(),
'Regular Avg Customers': store_regular[customers_col].mean()
})
store_df = pd.DataFrame(store_results)
# Print results
print(f"\n💰 SALES PERFORMANCE ANALYSIS")
print(f"="*40)
print(f"🎯 Average Sales (Promotional): ${promo_avg_sales:,.0f}")
print(f"📊 Average Sales (Regular): ${non_promo_avg_sales:,.0f}")
print(f"⬆️ Absolute Sales Lift: ${sales_lift:,.0f}")
print(f"📈 Percentage Sales Lift: +{sales_lift_pct:.2f}%")
print(f"\n👥 CUSTOMER TRAFFIC ANALYSIS")
print(f"="*40)
print(f"🎯 Average Customers (Promotional): {promo_avg_customers:,.0f}")
print(f"📊 Average Customers (Regular): {non_promo_avg_customers:,.0f}")
print(f"⬆️ Customer Traffic Lift: +{customer_lift:.0f}")
print(f"📈 Customer Traffic Lift: +{customer_lift_pct:.2f}%")
print(f"\n🎯 EFFICIENCY & PROFITABILITY")
print(f"="*40)
print(f"💳 Sales per Customer (Promotional): ${promo_sales_per_customer:.2f}")
print(f"💳 Sales per Customer (Regular): ${non_promo_sales_per_customer:.2f}")
print(f"📊 Spending Efficiency Gain: +{efficiency_improvement:.2f}%")
print(f"\n📊 STATISTICAL VALIDATION")
print(f"="*40)
print(f"🧮 T-Statistic: {t_stat:.2f}")
print(f"📈 P-Value: {p_value:.6f}")
print(f"✅ Statistically Significant: {'YES' if is_significant else 'NO'} (α=0.05)")
# Business insights
print(f"\n💡 KEY BUSINESS INSIGHTS")
print(f"="*40)
if sales_lift_pct > 50:
print(f"🚀 EXCEPTIONAL PERFORMANCE: Promotions drive outstanding sales growth!")
recommendation = "MAXIMIZE promotional frequency - ROI is excellent"
elif sales_lift_pct > 25:
print(f"✅ STRONG PERFORMANCE: Promotions are highly effective")
recommendation = "INCREASE promotional activities strategically"
elif sales_lift_pct > 10:
print(f"👍 GOOD PERFORMANCE: Promotions show solid results")
recommendation = "MAINTAIN current promotional strategy"
elif sales_lift_pct > 0:
print(f"⚠️ WEAK PERFORMANCE: Minimal promotional benefit")
recommendation = "REVIEW promotional costs vs benefits"
else:
print(f"❌ NEGATIVE IMPACT: Promotions may be hurting performance")
recommendation = "URGENT REVIEW of promotional strategy needed"
print(f"📋 RECOMMENDATION: {recommendation}")
# Traffic vs Spending analysis
if customer_lift_pct > efficiency_improvement:
print(f"👥 PRIMARY DRIVER: Promotions mainly drive FOOT TRAFFIC (+{customer_lift_pct:.1f}%)")
print(f" → Focus on conversion and upselling during promotions")
elif efficiency_improvement > customer_lift_pct:
print(f"💰 PRIMARY DRIVER: Promotions increase SPENDING PER VISIT (+{efficiency_improvement:.1f}%)")
print(f" → Excellent basket size improvement")
else:
print(f"⚖️ BALANCED IMPACT: Both traffic and spending improve equally")
# Store performance insights
if not store_df.empty:
best_store = store_df.loc[store_df['Sales Lift (%)'].idxmax()]
worst_store = store_df.loc[store_df['Sales Lift (%)'].idxmin()]
print(f"\n🏆 TOP PERFORMING STORE: #{int(best_store['Store'])}")
print(f" 📈 Sales Lift: +{best_store['Sales Lift (%)']:.1f}%")
print(f" 👥 Customer Lift: +{best_store['Customer Lift (%)']:.1f}%")
print(f" 🎯 Promo Rate: {best_store['Promo Rate (%)']:.1f}%")
print(f"\n📉 LOWEST PERFORMING STORE: #{int(worst_store['Store'])}")
print(f" 📈 Sales Lift: +{best_store['Sales Lift (%)']:.1f}%")
print(f" 👥 Customer Lift: +{worst_store['Customer Lift (%)']:.1f}%")
print(f" 🎯 Promo Rate: {worst_store['Promo Rate (%)']:.1f}%")
avg_lift = store_df['Sales Lift (%)'].mean()
consistent_stores = ((store_df['Sales Lift (%)'] - avg_lift).abs() < 10).sum()
print(f"\n📊 CONSISTENCY ANALYSIS:")
print(f" 🎯 Average Lift Across Stores: +{avg_lift:.1f}%")
print(f" 📏 Performance Consistency: {consistent_stores}/{len(store_df)} stores within ±10%")
if consistent_stores / len(store_df) > 0.8:
print(f" ✅ HIGHLY CONSISTENT: Promotions work well across all stores")
elif consistent_stores / len(store_df) > 0.6:
print(f" 👍 MODERATELY CONSISTENT: Most stores benefit similarly")
else:
print(f" ⚠️ INCONSISTENT: Results vary significantly by store")
print(f" → Investigate store-specific factors affecting promotional performance")
# Time-based insights (if date available)
if date_col in df_analysis.columns:
df_analysis['month'] = pd.to_datetime(df_analysis[date_col]).dt.month_name()
df_analysis['weekday'] = pd.to_datetime(df_analysis[date_col]).dt.day_name()
# Monthly performance
monthly_promo = df_analysis[df_analysis['promo_flag']==1].groupby('month')[sales_col].mean()
monthly_regular = df_analysis[df_analysis['promo_flag']==0].groupby('month')[sales_col].mean()
monthly_lift = ((monthly_promo - monthly_regular) / monthly_regular * 100).round(1)
best_month = monthly_lift.idxmax()
worst_month = monthly_lift.idxmin()
print(f"\n📅 SEASONAL INSIGHTS:")
print(f" 🏆 Best Month for Promos: {best_month} (+{monthly_lift[best_month]:.1f}%)")
print(f" 📉 Worst Month for Promos: {worst_month} (+{monthly_lift[worst_month]:.1f}%)")
print(f"\n🎉 ANALYSIS COMPLETE!")
# Return structured results
return {
'summary_metrics': {
'sales_lift_pct': sales_lift_pct,
'customer_lift_pct': customer_lift_pct,
'efficiency_improvement': efficiency_improvement,
'statistical_significance': is_significant,
'p_value': p_value
},
'store_performance': store_df,
'raw_data': {
'promo_avg_sales': promo_avg_sales,
'regular_avg_sales': non_promo_avg_sales,
'promo_avg_customers': promo_avg_customers,
'regular_avg_customers': non_promo_avg_customers
}
}
def create_promo_visualization(results_dict, store_df):
"""
Create visualizations for promotional analysis
"""
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
# 1. Sales comparison
metrics = ['Promo', 'Regular']
sales_values = [results_dict['raw_data']['promo_avg_sales'],
results_dict['raw_data']['regular_avg_sales']]
bars1 = ax1.bar(metrics, sales_values, color=['#ff6b6b', '#4ecdc4'], alpha=0.8)
ax1.set_title('Average Sales: Promotional vs Regular Days', fontsize=14, fontweight='bold')
ax1.set_ylabel('Average Sales ($)')
# Add value labels on bars
for bar in bars1:
height = bar.get_height()
ax1.text(bar.get_x() + bar.get_width()/2., height,
f'${height:,.0f}', ha='center', va='bottom', fontsize=12)
# 2. Customer traffic comparison
customer_values = [results_dict['raw_data']['promo_avg_customers'],
results_dict['raw_data']['regular_avg_customers']]
bars2 = ax2.bar(metrics, customer_values, color=['#ff9f43', '#54a0ff'], alpha=0.8)
ax2.set_title('Average Customer Traffic: Promotional vs Regular Days', fontsize=14, fontweight='bold')
ax2.set_ylabel('Average Customers')
for bar in bars2:
height = bar.get_height()
ax2.text(bar.get_x() + bar.get_width()/2., height,
f'{height:,.0f}', ha='center', va='bottom', fontsize=12)
# 3. Store performance distribution
if not store_df.empty:
ax3.hist(store_df['Sales Lift (%)'], bins=8, alpha=0.7, color='#ff6b6b', edgecolor='black')
ax3.set_title('Distribution of Sales Lift Across Stores', fontsize=14, fontweight='bold')
ax3.set_xlabel('Sales Lift (%)')
ax3.set_ylabel('Number of Stores')
ax3.axvline(store_df['Sales Lift (%)'].mean(), color='red', linestyle='--',
label=f'Mean: {store_df["Sales Lift (%)"].mean():.1f}%')
ax3.legend()
# 4. Key metrics summary
lift_pct = results_dict['summary_metrics']['sales_lift_pct']
customer_lift_pct = results_dict['summary_metrics']['customer_lift_pct']
efficiency = results_dict['summary_metrics']['efficiency_improvement']
metrics_names = ['Sales Lift', 'Customer Lift', 'Efficiency Gain']
metrics_values = [lift_pct, customer_lift_pct, efficiency]
colors = ['#ff6b6b', '#4ecdc4', '#45aaf2']
bars4 = ax4.bar(metrics_names, metrics_values, color=colors, alpha=0.8)
ax4.set_title('Key Performance Metrics (%)', fontsize=14, fontweight='bold')
ax4.set_ylabel('Improvement (%)')
for bar in bars4:
height = bar.get_height()
ax4.text(bar.get_x() + bar.get_width()/2., height,
f'+{height:.1f}%', ha='center', va='bottom', fontsize=12)
plt.tight_layout()
plt.show()
return fig
results = clean_promo_analysis(df_features)
🎯 PROMOTIONAL IMPACT ANALYSIS REPORT ============================================================ 📊 Data Overview: 814,150 records after removing closed days 🏪 Analyzing top 10 stores: [817, 262, 1114, 251, 842, 513, 562, 788, 383, 756] 📈 Analysis dataset: 7,702 records 🎯 Promotional days: 3,325 (43.2%) 📅 Regular days: 4,377 (56.8%) 💰 SALES PERFORMANCE ANALYSIS ======================================== 🎯 Average Sales (Promotional): $20,779 📊 Average Sales (Regular): $17,473 ⬆️ Absolute Sales Lift: $3,306 📈 Percentage Sales Lift: +18.92% 👥 CUSTOMER TRAFFIC ANALYSIS ======================================== 🎯 Average Customers (Promotional): 2,636 📊 Average Customers (Regular): 2,469 ⬆️ Customer Traffic Lift: +167 📈 Customer Traffic Lift: +6.78% 🎯 EFFICIENCY & PROFITABILITY ======================================== 💳 Sales per Customer (Promotional): $7.88 💳 Sales per Customer (Regular): $7.08 📊 Spending Efficiency Gain: +11.37% 📊 STATISTICAL VALIDATION ======================================== 🧮 T-Statistic: 39.17 📈 P-Value: 0.000000 ✅ Statistically Significant: YES (α=0.05) 💡 KEY BUSINESS INSIGHTS ======================================== 👍 GOOD PERFORMANCE: Promotions show solid results 📋 RECOMMENDATION: MAINTAIN current promotional strategy 💰 PRIMARY DRIVER: Promotions increase SPENDING PER VISIT (+11.4%) → Excellent basket size improvement 🏆 TOP PERFORMING STORE: #817 📈 Sales Lift: +32.8% 👥 Customer Lift: +21.1% 🎯 Promo Rate: 44.9% 📉 LOWEST PERFORMING STORE: #262 📈 Sales Lift: +32.8% 👥 Customer Lift: +-3.4% 🎯 Promo Rate: 38.1% 📊 CONSISTENCY ANALYSIS: 🎯 Average Lift Across Stores: +19.7% 📏 Performance Consistency: 8/10 stores within ±10% 👍 MODERATELY CONSISTENT: Most stores benefit similarly 📅 SEASONAL INSIGHTS: 🏆 Best Month for Promos: December (+27.0%) 📉 Worst Month for Promos: November (+11.2%) 🎉 ANALYSIS COMPLETE!
In [23]:
fig = create_promo_visualization(results, results['store_performance'])